House Price Prediction

The House Price Prediction project below, shows the use of different machine learning techniques in order to analyze the data available and finally predict the future values of prices using regression models.

With an initial Data Exploratory Analysis, the data will be analyzed to see the main characteristics of the dataset and any relations between the target variable and the independent ones. Also, location analysis will give a sense of where was this data recorded. This EDA process will lead to a Feature Engineering phase were data will be cleaned and transformed so it is ready to be the input of the different models to be tested.

Finally, the data will be splitted in order to have a train and test set. Also, evaluation metrics will be defined and four different regression techniques will be applied in order to evaluate results and pick the best model to generate predictions with the test sample.

Packages Loaded

For loading the data and initial conversion to a dataframe, the exploratory data analysis with some graphs so this stage is more visual and even the building of the models, R gives all necessary libraries such as data.table, ggplot2, MASS, randomForest, among others.

A list of libraries will be installed and then called in order to be used through the project.

Data Loading and Initial Preparation

Two CSV files were given in order to complete the project: house_price_train.csv and house_price_test.csv

Train

train_data<-fread('house_price_train.csv', stringsAsFactors = F)

cat('The Dimensions of the Train set are: ', dim(train_data))
## The Dimensions of the Train set are:  17277 21
str(train_data)
## Classes 'data.table' and 'data.frame':   17277 obs. of  21 variables:
##  $ id           :integer64 9183703376 464000600 2224079050 6163901283 6392003810 7974200948 2426059124 2115510300 ... 
##  $ date         : chr  "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(train_data)
##        id                 date               price        
##  Min.   :   1000102   Length:17277       Min.   :  78000  
##  1st Qu.:2113701080   Class :character   1st Qu.: 320000  
##  Median :3902100205   Mode  :character   Median : 450000  
##  Mean   :4566440237                      Mean   : 539865  
##  3rd Qu.:7302900090                      3rd Qu.: 645500  
##  Max.   :9900000190                      Max.   :7700000  
##     bedrooms        bathrooms      sqft_living       sqft_lot      
##  Min.   : 1.000   Min.   :0.500   Min.   :  370   Min.   :    520  
##  1st Qu.: 3.000   1st Qu.:1.750   1st Qu.: 1430   1st Qu.:   5050  
##  Median : 3.000   Median :2.250   Median : 1910   Median :   7620  
##  Mean   : 3.369   Mean   :2.114   Mean   : 2080   Mean   :  15186  
##  3rd Qu.: 4.000   3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10695  
##  Max.   :33.000   Max.   :8.000   Max.   :13540   Max.   :1164794  
##      floors        waterfront            view          condition    
##  Min.   :1.000   Min.   :0.000000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :1.500   Median :0.000000   Median :0.0000   Median :3.000  
##  Mean   :1.493   Mean   :0.007467   Mean   :0.2335   Mean   :3.413  
##  3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :3.500   Max.   :1.000000   Max.   :4.0000   Max.   :5.000  
##      grade         sqft_above   sqft_basement       yr_built   
##  Min.   : 3.00   Min.   : 370   Min.   :   0.0   Min.   :1900  
##  1st Qu.: 7.00   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median : 7.00   Median :1564   Median :   0.0   Median :1975  
##  Mean   : 7.66   Mean   :1791   Mean   : 289.4   Mean   :1971  
##  3rd Qu.: 8.00   3rd Qu.:2210   3rd Qu.: 556.0   3rd Qu.:1997  
##  Max.   :13.00   Max.   :9410   Max.   :4820.0   Max.   :2015  
##   yr_renovated        zipcode           lat             long       
##  Min.   :   0.00   Min.   :98001   Min.   :47.16   Min.   :-122.5  
##  1st Qu.:   0.00   1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3  
##  Median :   0.00   Median :98065   Median :47.57   Median :-122.2  
##  Mean   :  85.35   Mean   :98078   Mean   :47.56   Mean   :-122.2  
##  3rd Qu.:   0.00   3rd Qu.:98117   3rd Qu.:47.68   3rd Qu.:-122.1  
##  Max.   :2015.00   Max.   :98199   Max.   :47.78   Max.   :-121.3  
##  sqft_living15    sqft_lot15    
##  Min.   : 460   Min.   :   659  
##  1st Qu.:1490   1st Qu.:  5100  
##  Median :1840   Median :  7639  
##  Mean   :1986   Mean   : 12826  
##  3rd Qu.:2360   3rd Qu.: 10080  
##  Max.   :6210   Max.   :871200
head(train_data,3)
##            id      date  price bedrooms bathrooms sqft_living sqft_lot
## 1: 9183703376 5/13/2014 225000        3       1.5        1250     7500
## 2:  464000600 8/27/2014 641250        3       2.5        2220     2550
## 3: 2224079050 7/18/2014 810000        4       3.5        3980   209523
##    floors waterfront view condition grade sqft_above sqft_basement
## 1:      1          0    0         3     7       1250             0
## 2:      3          0    2         3    10       2220             0
## 3:      2          0    2         3     9       3980             0
##    yr_built yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1:     1967            0   98030 47.3719 -122.215          1260       7563
## 2:     1990            0   98117 47.6963 -122.393          2200       5610
## 3:     2006            0   98024 47.5574 -121.890          2220      65775
train_data$id <- NULL
price <- train_data$price
train_data$price <- NULL
train_data$price <- price

Test

test_data<-fread('house_price_test.csv', stringsAsFactors = F)
cat('The Dimensions of the Train set are: ', dim(test_data))
## The Dimensions of the Train set are:  4320 20
str(test_data)
## Classes 'data.table' and 'data.frame':   4320 obs. of  20 variables:
##  $ id           :integer64 6414100192 6054650070 16000397 2524049179 8562750320 7589200193 9547205180 1432701230 ... 
##  $ date         : chr  "12/9/2014" "10/7/2014" "12/5/2014" "8/26/2014" ...
##  $ bedrooms     : int  3 3 2 3 3 3 3 3 3 5 ...
##  $ bathrooms    : num  2.25 1.75 1 2.75 2.5 1 2.5 1 2.5 2.5 ...
##  $ sqft_living  : int  2570 1370 1200 3050 2320 1090 2300 1280 3160 3150 ...
##  $ sqft_lot     : int  7242 9680 9850 44867 3980 3000 3060 9656 13603 9134 ...
##  $ floors       : num  2 1 1 1 2 1.5 1.5 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 4 0 0 0 0 0 0 ...
##  $ condition    : int  3 4 4 3 3 4 3 4 3 4 ...
##  $ grade        : int  7 7 7 9 8 8 8 6 8 8 ...
##  $ sqft_above   : int  2170 1370 1200 2330 2320 1090 1510 920 3160 1640 ...
##  $ sqft_basement: int  400 0 0 720 0 0 790 360 0 1510 ...
##  $ yr_built     : int  1951 1977 1921 1968 2003 1929 1930 1959 2003 1966 ...
##  $ yr_renovated : int  1991 0 0 0 0 0 2002 0 0 0 ...
##  $ zipcode      : int  98125 98074 98002 98040 98027 98117 98115 98058 98019 98056 ...
##  $ lat          : num  47.7 47.6 47.3 47.5 47.5 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1690 1370 1060 4110 2580 1570 1590 1340 3050 1990 ...
##  $ sqft_lot15   : int  7639 10208 5095 20336 3980 5080 3264 8808 9232 9133 ...
##  - attr(*, ".internal.selfref")=<externalptr>
head(test_data,3)
##            id      date bedrooms bathrooms sqft_living sqft_lot floors
## 1: 6414100192 12/9/2014        3      2.25        2570     7242      2
## 2: 6054650070 10/7/2014        3      1.75        1370     9680      1
## 3:   16000397 12/5/2014        2      1.00        1200     9850      1
##    waterfront view condition grade sqft_above sqft_basement yr_built
## 1:          0    0         3     7       2170           400     1951
## 2:          0    0         4     7       1370             0     1977
## 3:          0    0         4     7       1200             0     1921
##    yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1:         1991   98125 47.7210 -122.319          1690       7639
## 2:            0   98074 47.6127 -122.045          1370      10208
## 3:            0   98002 47.3089 -122.210          1060       5095
test_labels <- test_data$id
test_data$id <- NULL

EDA - Basic Exploration

Missing Values and Duplicated Rows

The first step is to check if the datasets contain missing values in order to remove them. Also, duplicated rows will be analyzed in order to remove them.

## The number of missing values on TRAIN are 0
## The number of missing values on TRAIN are 0
## The number of duplicated rows on TRAIN are 0
## The number of duplicated rows on TEST are 0

As both datasets do not have any missing values or duplicated rows, the next step is to analyze the data with basic data visualization plots.

Target Variable

The target variable that we want to predict is Price so first a summary of the continuous variable and distribution plot will give initial description.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78000  320000  450000  539865  645500 7700000

Looking at the graph, the target variable distribution is right-skewed and even some outliers are detected.

Location Analysis

Having the coordinates of each house (longitude and latitude) a visual representation were built in order to see where is this data coming from (Seattle) and even separated in different clusters based on price ranges.

## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively

Evolution of Prices Over Time

One of the main analysis is to understand the evolution of prices over the years and even how many houses were built so we can see more relations between time or some specific events that could even impact.

The data available is from 2014 and 2015. So looking at the first time series plot, the prices for this two years are not stationary meaning that there could be a relation that explains the future with past data so with regression future prices could be predicted.

The dataset contains all the data from each house bought such as the year when it was built so using this information, two different plots will show how prices change over the years when houses were built and how many houses were sold. The curve shows how prices started to decay and then they grew back again this could be due to specific events on the 60’s such as the Cold War.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Discrete Variables Distribution

The datasets have all numerical variables except for the date so in order to see the distributions in a clearer way, the data variables were splitted between discrete and continuous variables so different types of chart could be used (bar charts or density charts)

Continuous Variables Distribution

Most of the distribution are skewed to the right.

Relation Between Target Variable and Few Independent Variables

As price is plotted against other independent variables, outliers and even how they could be related will give initial insights for next feature engineering phase.

With the above graphs outliers are clearly detected and some correlation between the variables may be detected but this will be checked on next steps with the proper correlation analysis.

Correlation Analysis

A general correlation plot looking at all the variables against the target and each other. Even a small correlation plot only with high correlations will help to understand better which variables are more correlated and how to treat them later.

Highly correlated variables are considered to be does that have a coefficient higher than 0.80 but in this a threshold of 0.5 were used.

Feature Engineering

Datasets Combination

Both train and test datasets provided will be combined in order to apply some feature engineering.

Also, a new variables ‘isTrain’ will help to split it again based on how they were given.

train <- train_data
test <- test_data
test$price <- NA

str(train)
## Classes 'data.table' and 'data.frame':   17277 obs. of  20 variables:
##  $ date         : chr  "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame':   4320 obs. of  20 variables:
##  $ date         : chr  "12/9/2014" "10/7/2014" "12/5/2014" "8/26/2014" ...
##  $ bedrooms     : int  3 3 2 3 3 3 3 3 3 5 ...
##  $ bathrooms    : num  2.25 1.75 1 2.75 2.5 1 2.5 1 2.5 2.5 ...
##  $ sqft_living  : int  2570 1370 1200 3050 2320 1090 2300 1280 3160 3150 ...
##  $ sqft_lot     : int  7242 9680 9850 44867 3980 3000 3060 9656 13603 9134 ...
##  $ floors       : num  2 1 1 1 2 1.5 1.5 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 4 0 0 0 0 0 0 ...
##  $ condition    : int  3 4 4 3 3 4 3 4 3 4 ...
##  $ grade        : int  7 7 7 9 8 8 8 6 8 8 ...
##  $ sqft_above   : int  2170 1370 1200 2330 2320 1090 1510 920 3160 1640 ...
##  $ sqft_basement: int  400 0 0 720 0 0 790 360 0 1510 ...
##  $ yr_built     : int  1951 1977 1921 1968 2003 1929 1930 1959 2003 1966 ...
##  $ yr_renovated : int  1991 0 0 0 0 0 2002 0 0 0 ...
##  $ zipcode      : int  98125 98074 98002 98040 98027 98117 98115 98058 98019 98056 ...
##  $ lat          : num  47.7 47.6 47.3 47.5 47.5 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1690 1370 1060 4110 2580 1570 1590 1340 3050 1990 ...
##  $ sqft_lot15   : int  7639 10208 5095 20336 3980 5080 3264 8808 9232 9133 ...
##  $ price        : logi  NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
train$isTrain <- 1
test$isTrain <- 0

house_base <- rbind(train,test)
str(house_base)
## Classes 'data.table' and 'data.frame':   21597 obs. of  21 variables:
##  $ date         : chr  "5/13/2014" "8/27/2014" "7/18/2014" "1/30/2015" ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ isTrain      : num  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Feature Creation

New variables were created such as the Age of a house or if it was or not renovated. Finally, some features as date, latitude and longitude will not be used as input for the models. As some high correlations appeared, sqft_above will be dropped too having a coefficient of 0.88 with sqft_living

##### **NEW FEATURES** #####
library(lubridate)
house_base$houseAge <- year(Sys.time()) - house_base$yr_built
house_base$renovated <- ifelse(house_base$yr_renovated == 0, 0, 1)
#house_base[, c('date','sqft_above', 'latitude', 'longitude'):=NULL]
house_base[ ,c('date','sqft_above', 'lat', 'long')] <- list(NULL)
str(house_base)
## Classes 'data.table' and 'data.frame':   21597 obs. of  19 variables:
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ isTrain      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ houseAge     : num  52 29 13 52 68 11 24 36 32 60 ...
##  $ renovated    : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>
head(house_base,3)
##    bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1:        3       1.5        1250     7500      1          0    0
## 2:        3       2.5        2220     2550      3          0    2
## 3:        4       3.5        3980   209523      2          0    2
##    condition grade sqft_basement yr_built yr_renovated zipcode
## 1:         3     7             0     1967            0   98030
## 2:         3    10             0     1990            0   98117
## 3:         3     9             0     2006            0   98024
##    sqft_living15 sqft_lot15  price isTrain houseAge renovated
## 1:          1260       7563 225000       1       52         0
## 2:          2200       5610 641250       1       29         0
## 3:          2220      65775 810000       1       13         0

Data Normalization and Outliers Treatment

All data was centered and scaled in order to apply all models in the next stage.

Outliers will be kept in this case, but for next steps or future improvements of the model they could be removed.

So the final dataset, ready to be used for modeling is:

price <- house_base$price
isTrain <- house_base$isTrain
renovated <- house_base$renovated
yr_renovated <- house_base$yr_renovated
yr_built <- house_base$yr_built
zipcode <- house_base$zipcode
house_base[, c('price', 'isTrain','renovated', 'zipcode', 'yr_renovated','yr_built'):=NULL]

#NORMALIZING
num <- preProcess(house_base, method=c("center", "scale"))

house_base <- predict(num, house_base)
house_final <- cbind(house_base,renovated, zipcode, yr_built, yr_renovated, isTrain, price)

str(house_final)
## Classes 'data.table' and 'data.frame':   21597 obs. of  19 variables:
##  $ bedrooms     : num  -0.403 -0.403 0.677 0.677 0.677 ...
##  $ bathrooms    : num  -0.801 0.5 1.8 -0.801 -0.476 ...
##  $ sqft_living  : num  -0.904 0.152 2.069 -0.207 -0.29 ...
##  $ sqft_lot     : num  -0.184 -0.303 4.695 -0.183 -0.244 ...
##  $ floors       : num  -0.916 2.79 0.937 -0.916 -0.916 ...
##  $ waterfront   : num  -0.0872 -0.0872 -0.0872 -0.0872 -0.0872 ...
##  $ view         : num  -0.306 2.304 2.304 -0.306 -0.306 ...
##  $ condition    : num  -0.63 -0.63 -0.63 0.907 0.907 ...
##  $ grade        : num  -0.561 1.996 1.144 -0.561 -0.561 ...
##  $ sqft_basement: num  -0.659 -0.659 -0.659 -0.659 1.306 ...
##  $ sqft_living15: num  -1.06 0.311 0.341 -0.141 -1.017 ...
##  $ sqft_lot15   : num  -0.19 -0.262 1.944 -0.156 -0.284 ...
##  $ houseAge     : num  0.136 -0.647 -1.191 0.136 0.681 ...
##  $ renovated    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ isTrain      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(house_final)
##     bedrooms         bathrooms        sqft_living         sqft_lot      
##  Min.   :-2.5620   Min.   :-2.1012   Min.   :-1.8629   Min.   :-0.3520  
##  1st Qu.:-0.4029   1st Qu.:-0.4757   1st Qu.:-0.7083   1st Qu.:-0.2429  
##  Median :-0.4029   Median : 0.1745   Median :-0.1855   Median :-0.1807  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6767   3rd Qu.: 0.4996   3rd Qu.: 0.5116   3rd Qu.:-0.1066  
##  Max.   :31.9841   Max.   : 7.6519   Max.   :12.4819   Max.   :39.5111  
##                                                                         
##      floors           waterfront           view           condition      
##  Min.   :-0.91553   Min.   :-0.0872   Min.   :-0.3057   Min.   :-3.7043  
##  1st Qu.:-0.91553   1st Qu.:-0.0872   1st Qu.:-0.3057   1st Qu.:-0.6300  
##  Median : 0.01094   Median :-0.0872   Median :-0.3057   Median :-0.6300  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.93741   3rd Qu.:-0.0872   3rd Qu.:-0.3057   3rd Qu.: 0.9072  
##  Max.   : 3.71682   Max.   :11.4669   Max.   : 4.9136   Max.   : 2.4444  
##                                                                          
##      grade         sqft_basement    sqft_living15       sqft_lot15      
##  Min.   :-3.9703   Min.   :-0.659   Min.   :-2.3169   Min.   :-0.44391  
##  1st Qu.:-0.5608   1st Qu.:-0.659   1st Qu.:-0.7247   1st Qu.:-0.28079  
##  Median :-0.5608   Median :-0.659   Median :-0.2140   Median :-0.18839  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.2916   3rd Qu.: 0.606   3rd Qu.: 0.5449   3rd Qu.:-0.09809  
##  Max.   : 4.5534   Max.   :10.230   Max.   : 6.1634   Max.   :31.47422  
##                                                                         
##     houseAge         renovated          zipcode         yr_built   
##  Min.   :-1.4979   Min.   :0.00000   Min.   :98001   Min.   :1900  
##  1st Qu.:-0.8851   1st Qu.:0.00000   1st Qu.:98033   1st Qu.:1951  
##  Median :-0.1362   Median :0.00000   Median :98065   Median :1975  
##  Mean   : 0.0000   Mean   :0.04232   Mean   :98078   Mean   :1971  
##  3rd Qu.: 0.6808   3rd Qu.:0.00000   3rd Qu.:98118   3rd Qu.:1997  
##  Max.   : 2.4170   Max.   :1.00000   Max.   :98199   Max.   :2015  
##                                                                    
##   yr_renovated        isTrain        price        
##  Min.   :   0.00   Min.   :0.0   Min.   :  78000  
##  1st Qu.:   0.00   1st Qu.:1.0   1st Qu.: 320000  
##  Median :   0.00   Median :1.0   Median : 450000  
##  Mean   :  84.46   Mean   :0.8   Mean   : 539865  
##  3rd Qu.:   0.00   3rd Qu.:1.0   3rd Qu.: 645500  
##  Max.   :2015.00   Max.   :1.0   Max.   :7700000  
##                                  NA's   :4320

Data Splitting (Train - Test Sets)

After splitting the house_base dataset into train and test, again both have the same number of rows and are ready to be used.

train_model <- house_final[house_final$isTrain==1,]
test_model <- house_final[house_final$isTrain==0,]
smp_size <- floor(0.75 * nrow(train_model))

set.seed(123)
train_ind <- sample(seq_len(nrow(train_model)), size = smp_size)

train_new <- train_model[train_ind, ]
test_new <- train_model[-train_ind, ]
nrow(train_new)
## [1] 12957
nrow(test_new)
## [1] 4320
train_new$isTrain <- NULL
test_new$isTrain <- NULL

Modeling

After applying each model, a formula is set to be used in each as we want to predict the prices against all the rest of the independent variables.

Also, some metrics are defined and will be used to evaluate the outcome of each model and do a final evaluation to pick the best.

#### FORMULA
formula<-as.formula(price~.) 

#METRICS
mape<-function(real,predicted){return(mean(abs((real-predicted)/real)))}
mae<-function(real,predicted){return(mean(abs(real-predicted)))}
rmse<-function(real,predicted){return(sqrt(mean((real-predicted)^2)))}

Regression with Regularization

## 
## Call:  glmnet(x = data.matrix(train_new[, !"price"]), y = train_new[["price"]],      family = "gaussian", alpha = 1, lambda = lasso_cv$lambda.min) 
## 
##      Df   %Dev Lambda
## [1,] 17 0.6657  598.4
## 17 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## bedrooms      -28227.731784
## bathrooms      28556.726059
## sqft_living   137820.947906
## sqft_lot        -212.101496
## floors         17427.876046
## waterfront     46832.914208
## view           34754.358019
## condition      14412.895924
## grade         145345.810730
## sqft_basement   5265.698159
## sqft_living15  16770.062405
## sqft_lot15    -14116.564072
## houseAge      101720.540892
## renovated      26464.991208
## zipcode           14.455318
## yr_built         -10.511604
## yr_renovated       4.099015

##          RMSE    MAE MAPE
## [1,] 231312.4 139965 29.1

Random Forest

##          RMSE      MAE MAPE
## [1,] 184895.8 100685.6   20

Regression With Feature Selection (Stepwise)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + grade + sqft_basement + sqft_living15 + 
##     sqft_lot15 + houseAge + renovated + yr_renovated, data = train_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1223760  -108376   -11288    90776  4371207 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     538276.4     1872.6 287.456  < 2e-16 ***
## bedrooms        -29969.3     2274.0 -13.179  < 2e-16 ***
## bathrooms        28785.0     3368.7   8.545  < 2e-16 ***
## sqft_living     138529.9     4391.1  31.548  < 2e-16 ***
## floors           18943.1     2529.0   7.490 7.32e-14 ***
## waterfront       47567.6     2012.6  23.635  < 2e-16 ***
## view             34869.5     2150.4  16.215  < 2e-16 ***
## condition        15389.0     2018.9   7.622 2.66e-14 ***
## grade           144732.0     3312.7  43.691  < 2e-16 ***
## sqft_basement     5549.5     2475.9   2.241    0.025 *  
## sqft_living15    17699.7     3048.3   5.806 6.53e-09 ***
## sqft_lot15      -15144.4     2005.5  -7.552 4.59e-14 ***
## houseAge        103520.3     2597.5  39.854  < 2e-16 ***
## renovated     -4991309.2  1147050.4  -4.351 1.36e-05 ***
## yr_renovated      2518.7      574.7   4.383 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208000 on 12942 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6659 
## F-statistic:  1845 on 14 and 12942 DF,  p-value: < 2.2e-16

##          RMSE      MAE MAPE
## [1,] 230779.8 139745.5   29

XGBoosting Tree

## [1]  train-rmse:477483.093750 
## [2]  train-rmse:359726.156250 
## [3]  train-rmse:281496.093750 
## [4]  train-rmse:228993.453125 
## [5]  train-rmse:196445.343750 
## [6]  train-rmse:174051.703125 
## [7]  train-rmse:159525.312500 
## [8]  train-rmse:150715.093750 
## [9]  train-rmse:140099.703125 
## [10] train-rmse:135083.375000 
## [11] train-rmse:131570.000000 
## [12] train-rmse:128801.375000 
## [13] train-rmse:126498.015625 
## [14] train-rmse:122762.968750 
## [15] train-rmse:119967.203125 
## [16] train-rmse:117354.500000 
## [17] train-rmse:116233.812500 
## [18] train-rmse:112307.132812 
## [19] train-rmse:111647.343750 
## [20] train-rmse:109219.843750 
## [21] train-rmse:108185.234375 
## [22] train-rmse:106265.015625 
## [23] train-rmse:104316.257812 
## [24] train-rmse:103114.523438 
## [25] train-rmse:102056.515625 
## [26] train-rmse:100778.070312 
## [27] train-rmse:99862.625000 
## [28] train-rmse:98468.093750 
## [29] train-rmse:97836.898438 
## [30] train-rmse:97140.187500 
## [31] train-rmse:96321.273438 
## [32] train-rmse:95929.156250 
## [33] train-rmse:94263.242188 
## [34] train-rmse:92995.500000 
## [35] train-rmse:91637.539062 
## [36] train-rmse:90815.960938 
## [37] train-rmse:90135.125000 
## [38] train-rmse:89385.468750 
## [39] train-rmse:88445.664062 
## [40] train-rmse:86862.632812 
## [41] train-rmse:86106.320312 
## [42] train-rmse:85549.445312 
## [43] train-rmse:85265.031250 
## [44] train-rmse:84441.882812 
## [45] train-rmse:83766.250000 
## [46] train-rmse:82960.976562 
## [47] train-rmse:82489.421875 
## [48] train-rmse:81461.023438 
## [49] train-rmse:81105.289062 
## [50] train-rmse:80669.539062 
## [51] train-rmse:80058.250000 
## [52] train-rmse:79767.085938 
## [53] train-rmse:78976.695312 
## [54] train-rmse:78341.156250 
## [55] train-rmse:77761.132812 
## [56] train-rmse:77004.453125 
## [57] train-rmse:76689.328125 
## [58] train-rmse:76288.867188 
## [59] train-rmse:75589.890625 
## [60] train-rmse:74811.695312

##          RMSE      MAE MAPE
## [1,] 156843.4 79998.87 15.1

Model Evaluation

The MAPE or Mean Absolute Percentage Error will be the main metric to be used in order to compare the results.

##    method     rmse       mae      mape
## 1: glmnet 231312.4 139964.95 0.2906808
## 2:     rf 184895.8 100685.59 0.2003794
## 3:     lm 230779.8 139745.52 0.2900942
## 4:    xgb 156843.4  79998.87 0.1508053

With a lower MAPE the model to be used in order to predict the house prices will be **XGBoosting Tress*

Prediction File Generation with Selected Model

Finally, all the prices for the test_id labels will be predicted and stored in a .txt file.

pred <- data.frame(id=test_labels,price=round(df_predicted$test_xgb))
write.table(pred,file="House_Price_Pred_1.txt",row.names=F, sep = ',')

Conclusion

House prices were predicted using a machine learning process, where the data was analyzed and presented in a visual way to get insights. Some techniques were applied to clean and transformed the data so it was splitted and used to create different models such as Random Forest Tree, Lasso Regression, Regression with Stepwise Feature Selection and XGBoosting Tree which had the lower MAPE so this was the final model used to predict the prices from the test dataset.

As future improvements on the model, outliers could be removed and parameters could me tuned in order to see if the results improve.